172 research outputs found

    Boolean Matrix Factorization Meets Consecutive Ones Property

    No full text
    Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless P = NP). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is NP-hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well

    Density-friendly Graph Decomposition

    Full text link
    Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere de-gree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connected-ness, and it allows to reveal the structural organization of the graph. Despite the fact that k-core analysis relies on vertex de-grees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not nec-essarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are ar-ranged in order of increasing density. We show that such a locally-dense decomposition for a graph G = (V,E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V |2|E|) but is signifi-cantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 ap-proximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density

    Interactive and Iterative Discovery of Entity Network Subgraphs

    Get PDF
    Graph mining to extract interesting components has been studied in various guises, e.g., communities, dense subgraphs, cliques. However, most existing works are based on notions of frequency and connectivity and do not capture subjective interestingness from a user's viewpoint. Furthermore, existing approaches to mine graphs are not interactive and cannot incorporate user feedbacks in any natural manner. In this paper, we address these gaps by proposing a graph maximum entropy model to discover surprising connected subgraph patterns from entity graphs. This model is embedded in an interactive visualization framework to enable human-in-the-loop, model-guided data exploration. Using case studies on real datasets, we demonstrate how interactions between users and the maximum entropy model lead to faster and explainable conclusions

    Generating Realistic Synthetic Population Datasets

    Get PDF
    Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about the data, and are unbiased otherwise. An efficient inference algorithm is designed to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and on US census datasets, and demonstrate its feasibility using an epidemic simulation application

    Fast Likelihood-Based Change Point Detection

    Get PDF
    Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in O(n) time, where n is the number of entries since the last change point. This is too expensive for large n. To combat this we propose an approximation scheme that yields (1 - epsilon) approximation in O(epsilon(-1) log(2) n) time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size O(epsilon(-1) log n) formodel parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.Peer reviewe

    Lymphatic endothelium stimulates melanoma metastasis and invasion via MMP14-dependent Notch3 and b1-integrin activation

    Get PDF
    Lymphatic invasion and lymph node metastasis correlate with poor clinical outcome in melanoma. However, the mechanisms of lymphatic dissemination in distant metastasis remain incompletely understood. We show here that exposure of expansively growing human WM852 melanoma cells, but not singly invasive Bowes cells, to lymphatic endothelial cells (LEC) in 3D co-culture facilitates melanoma distant organ metastasis in mice. To dissect the underlying molecular mechanisms, we established LEC co-cultures with different melanoma cells originating from primary tumors or metastases. Notably, the expansively growing metastatic melanoma cells adopted an invasively sprouting phenotype in 3D matrix that was dependent on MMP14, Notch3 and β1-integrin. Unexpectedly, MMP14 was necessary for LEC-induced Notch3 induction and coincident β1-integrin activation. Moreover, MMP14 and Notch3 were required for LEC-mediated metastasis of zebrafish xenografts. This study uncovers a unique mechanism whereby LEC contact promotes melanoma metastasis by inducing a reversible switch from 3D growth to invasively sprouting cell phenotype

    Genome-scaled phylogeny of Saccharomyces cerevisiae from spontaneous must fermentations

    Get PDF
    Modern winemakers commonly inoculate selected S. cerevisiae strains in must to obtain controlled fermentations and reproducible products. However, wine has been produced for thousands of years using spontaneous fermentations from wild strains, a practice that is experiencing a revival among small wine producers. Despite the widespread usage of such strains in the past, there is much to know about their ecology, evolution and functional potential. For example, the reciprocal affinities of these strains within the S. cerevisiae phylogeny have yet to be discovered, as well as the degree of their biodiversity and their impact on wine terroir. To fill this knowledge gap, we aim at characterising at strain level the S. cerevisiae present in spontaneously fermented musts sampled across Italy. We set up a protocol based on polyphenols-removing prewashes, followed by whole-genome shotgun sequencing at a depth of 5Gb of DNA per sample. We performed both an assembly-free analysis to reconstruct the strain-level phylogeny of S. cerevisiae strains using the species-specific-marker based StrainPhlAn, and the reconstruction of Metagenome-Assembled Genomes of these strains for downstream functional analyses. To plan conservation acts in a scenario of continuous climate change, we aim at isolating and maintaining strains of interest. We will present preliminary results from the analysis of spontaneous musts sampled at different fermenting stages

    On Coupling FCA and MDL in Pattern Mining

    Get PDF
    International audiencePattern Mining is a well-studied field in Data Mining and Machine Learning. The modern methods are based on dynamically updating models, among which MDL-based ones ensure high-quality pattern sets. Formal concepts also characterize patterns in a condensed form. In this paper we study MDL-based algorithm called Krimp in FCA settings and propose a modified version that benefits from FCA and relies on probabilistic assumptions that underlie MDL. We provide an experimental proof that the proposed approach improves quality of pattern sets generated by Krimp

    Choroidal vascularity map in unilateral central serous chorioretinopathy: A comparison with fellow and healthy eyes

    Get PDF
    Background: To map the choroidal vascularity index and compare two eyes in patients with unilateral central serous chorioretinopathy (CSCR). Methods: This was a retrospective, observa-tional study performed in patients with unilateral CSCR. Choroidal thickness (CT) and Choroidal vascularity index (CVI) were measured and mapped in various zones according to the early treatment diabetic retinopathy (ETDRS) grid. Results: A total of 20 CSCR patients (20 study and 20 fellow eyes) were included in the study. Outer nasal region CT was seen to be significantly lower than central CT (p = 0.042) and inner nasal CT (p = 0.007); outer ring CT was significantly less than central (p = 0.04) and inner ring (p = 0.01) CT in CSCR eyes. On potting all the CVI values against the corresponding CT values, a positive correlation was seen in CSCR eyes (r = 0.54, p < 0.01), which was slightly weaker in fellow eyes (r = 0.3, p < 0.01) and a negative correlation was seen in healthy eyes (r = −0.262, p < 0.01). Conclusions: Correlation between CVI and CT was altered in CSCR eyes as compared to fellow and normal eyes with increasing CVI towards the center of the macula and superiorly in CSCR eyes

    Fast Generation of Best Interval Patterns for Nonmonotonic Constraints

    Get PDF
    International audienceIn pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are neither monotonic nor anti-monotonic, which makes it difficult to generate patterns satisfying these constraints.In this paper we introduce the notion of "generalized monotonicity" and Sofia algorithm that allow generating best patterns in polynomial time for some nonmonotonic constraints modulo constraint computation and pattern extension operations. In particular, this algorithm is polynomial for data on itemsets and interval tuples. In this paper we consider stability and delta-measure which are nonmonotonic constraints and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches
    • …
    corecore